Multimodal Machine Learning: Integrating Language, Vision and Speech
نویسندگان
چکیده
Multimodal machine learning is a vibrant multi-disciplinary research field which addresses some of the original goals of artificial intelligence by integrating and modeling multiple communicative modalities, including linguistic, acoustic and visual messages. With the initial research on audio-visual speech recognition and more recently with language & vision projects such as image and video captioning and visual question answering, this research field brings some unique challenges for multimodal researchers given the heterogeneity of the data and the contingency often found between modalities.
منابع مشابه
On the Integration of Grounding Language and Learning Objects
This paper presents a multimodal learning system that can ground spoken names of objects in their physical referents and learn to recognize those objects simultaneously from naturally co-occurring multisensory input. There are two technical problems involved: (1) the correspondence problem in symbol grounding – how to associate words (symbols) with their perceptually grounded meanings from mult...
متن کامل1 - - - From Speech Recognition to Language and Multimodal Processing
While artificial neural networks have been around for over half a century, it was not until year 2010 that they had made a significant impact on speech recognition with a deep form of such networks. This article, based on my keynote talk given at Interspeech conference in Singapore in September 2014, will first reflect on the historical path to this transformative success, after providing brief...
متن کاملAuxiliary Multimodal LSTM for Audio-visual Speech Recognition and Lipreading
The Aduio-visual Speech Recognition (AVSR) which employs both the video and audio information to do Automatic Speech Recognition (ASR) is one of the application of multimodal leaning making ASR system more robust and accuracy. The traditional models usually treated AVSR as inference or projection but strict prior limits its ability. As the revival of deep learning, Deep Neural Networks (DNN) be...
متن کاملMultimodal Adaptive Interfaces
Our group is interested in creating human machine interfaces which use natural modalities such as vision and speech to sense and interpret a user's actions [6]. In this paper we describe recent work on multimodal adaptive interfaces which combine automatic speech recognition, computer vision for gesture tracking, and machine learning techniques. Speech is the primary mode of communication betwe...
متن کاملRecognizing Unfamiliar Gestures for Human-Robot Interaction Through Zero-Shot Learning
Human communication is highly multimodal, including speech, gesture, gaze, facial expressions, and body language. Robots serving as human teammates must act on such multimodal communicative inputs from humans, even when the message may not be clear from any single modality. In this paper, we explore a method for achieving increased understanding of complex, situated communications by leveraging...
متن کامل